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ABSTRACT 

Distilling knowledge from a well-trained cumbersome net¬ 
work to a small one has recently become a new research 
topic, as lightweight neural networks with high performance 
are particularly in need in various resource-restricted sys¬ 
tems. This paper addresses the problem of distilling word 
embeddings for NLP tasks. We propose an encoding ap¬ 
proach to distill task-specific knowledge from a set of high¬ 
dimensional embeddings, which can reduce model complex¬ 
ity by a large margin as well as retain high accuracy, showing 
a good compromise between efficiency and performance. Ex¬ 
periments in two tasks reveal the phenomenon that distilling 
knowledge from cumbersome embeddings is better than di¬ 
rectly training neural networks with small embeddings. 
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1. INTRODUCTION 

Distilling knowledge from a neural network—that is, trans¬ 
ferring valuable knowledge from a cumbersome network to 
a lightweight one—is pioneered by Bucilua et al. [^; it has 
attracted increasing attention over the last two years . 

As addressed by Hinton et al. the objective of train¬ 
ing networks is probably different from deploying networks: 
during training we focus on extracting as much knowledge 
as possible from a large dataset, whereas deploying net¬ 
works takes into consideration multiple aspects, including 
accuracy, memory, time, and energy consumption. It would 
be appealing if we can first well train a cumbersome net¬ 
work offline, and then distill its knowledge to a small one 
for deployment. The aim of knowledge distillation is thus 
to reduce model complexity as well as to retain high perfor¬ 
mance, which is particularly important to neural networks’ 
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applications in resource-restricted scenarios, e.g., real-time 
systems, mobile devices, and large ensembles of models. 

Much evidence in the literature shows the feasibility of 
transferring knowledge from one neural network to another, 
for instance, from shallow networks to deep ones [11| , from 
feed-forward networks to recurrent ones , or vice versa 
1^. The main idea of the above studies is to train a teacher 
model hrst, and then use the teacher model’s output (esti¬ 
mated probabilities by softmax, say, in a classihcation prob¬ 
lem) to guide a student model. Several variants of train¬ 
ing objectives include applying regression over the input 
of softmax and softening the teacher model’s probabil¬ 
ities We call such approaches “matching softmax” (Fig¬ 
ure [Tij. It is also argued that the estimated probabilities by 
a teacher model convey more information than one-hot rep¬ 
resented ground truth; hence knowledge distillation is feasi¬ 
ble and beneficial [^. 

Despite the above generic approach, this paper focuses 
on distilling word embeddings in NLP applications. Par¬ 
ticularly, we find the specificity of embeddings brings new 
opportunities for knowledge distillation. 

As word embeddings map discrete words to distributed, 
real-valued vectors, it can be viewed that a word is first rep¬ 
resented as a one-hot vector and then the vector is multi¬ 
plied by a large embedding matrix, known as a look-up table. 
During the matrix-vector multiplication, one and only one 
column in the look-up table is verbatim retrieved for a par¬ 
ticular word. Thus, we may build an interlayer—sandwiched 
between the high-dimensional embeddings and the ensuing 
network—to squash embeddings to a low-dimensional space 
(Figure [^). The standard cross-entropy loss can then be 
applied to train the encoding layer and other parameters 
in the network. In such a supervised manner, task-specific 
knowledge in the original cumbersome embeddings can be 
distilled to low-dimensional ones. 

In summary, the main contributions of this paper are 
three-fold: (1) We address the problem of distilling word 
embeddings in NLP applications. (2) We propose a super¬ 
vised encoding approach to distill task-specihc knowledge 
from cumbersome word embeddings. (3) Our experimen¬ 
tal results in sentiment analysis and relation classification 
tasks reveal a phenomenon that distilling low-dimensional 
embeddings from large ones is better than directly training 
a network with small embeddings. 

It should also be noticed that the proposed encoding ap¬ 
proach does not rely on a teacher model; our method is 
complementary to existing matching softmax for knowledge 
distillation. 
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Figure 1: Distilling knowledge by (a) matching softmax, and (b) encoding embeddings. 


2. BACKGROUND OF MATCHING SOFT- 
MAX 


As said, existing approaches to knowledge transfer be¬ 
tween two neural networks mainly follow a two-step strat¬ 
egy: first training a teacher network; then using the teacher 
network to guide a student model by matching softmax, de¬ 
picted in Figure [^. 

For a classification problem, softmax is typically used as 
the output layer’s activation function. Let z G R"’'" be the 
input of softmax. {uc is the number of classes.) The output 
of softmax is 


where T is a relaxation variable (used later), called temper¬ 
ature. T = 1 for standard softmax. 

Take a 3-way classification problem as an example. If 
a teacher model estimates y = (0.95,0.04,0.01)^ for three 
classes, it is valuable information to the student model that 
Class 2 is more similar to Class 1 than Class 3 to Class 1. 

However, directly imposing constraints on the output of 
softmax may be ineffective: the difference between 0.04 and 
0.01 is too small. Ba et al. match the input of softmax, 
z, rather than y. Hinton et al. raise the temperature 
T during training, which makes the estimated probabilities 
softer over different classes. The temperature of 3, for in¬ 
stance, softens the above y to (0.64,0.22,0.14)^. Match¬ 
ing softmax can also be applied along with standard cross¬ 
entropy loss (with one-hot ground truth), or more elabo¬ 
rately, the teacher model’s effect declines in an annealing 
fashion when the student model is more aware of data 


11 


3. THE PROPOSED ENCODING APPROACH 
FOR DISTILLING EMBEDDINGS 

This section introduces in detail our proposed method for 
word embedding distillation (Figure]^). We also analyze 
the neural network’s model capacity with distilled embed¬ 
dings, and discuss the rationale for distilling small embed¬ 
dings from large ones in a supervised manner, instead of 
directly training with small embeddings. 

Word embeddings are a standard component for neural 
natural language processing. As feeding word indexes di¬ 
rectly to neural networks is somewhat nonsensical, words 
are mapped to a real-valued vector, called embeddings, where 


each dimension captures a certain aspect of underlying word 
semantics. Usually, they are trained in an unsupervised fash¬ 
ion, e.g., maximizing the probability of a large corpus [^|^, 
or maximizing a scoring function . The learned embed¬ 
dings can be fed to standard neural networks for supervised 
learning, e.g., POS tagging, named entity recognition, and 
semantic role labeling [^. 

To formalize word embeddings in algebraic notations, we 
let Xi € R^"^' be one-hot representation of the i-th word Xi 
in the vocabulary y-, the i-th element in the vector Xi is 
on, with other elements being 0. Let <Fc £ R’*embedx|^| 
(cumbersome) embedding matrix (look-up table). Then the 
vector representation of the word is exactly the i-th column 
of the matrix, given by <f>c • a;i. 

Now we consider distilling, from cumbersome embeddings 
<f>c • Xi, an ridistiii-dimensional vector for the word, where 
’^distill is smaller than riembed • It is accomplished by encoding 
with a non-linear neural layer, i.e., 

Vec(Xi) — y(fUencode ' b encode ) (1) 

where Wencode € Redistill X numbed bencode € R^-distiii are 

parameters of the encoding layer; vec(-) denotes the distilled 
vector representation of a word. 

These distilled embeddings can then be fed to a neural 
network (with parameters 0) for further processing. Let m 
be the number of data samples and ric be the number of 
target classes; suppose further y^^'^ is the output of softmax 
for the j-th data sample and the one-hot represented 
ground truth. Our training objective is the standard cross¬ 
entropy loss, given by 

m Tie 

minimize — tp^logup^ 

M'encodo.boncodo.©,* 

J = 1 1=1 

We would like to point out that distilling embeddings does 
not increase, or in fact may reduce, model capacity vis-a-vis 
directly training with small embeddings, despite the large 
number of parameters in cumbersome embeddings and the 
coding layer’s weights. 

Theorem 1. The model capacity of a neural network with 
distilled embeddings is less than or equal to that of a neural 
network trained directly with small embeddings. 

Proof. The intuition is straightforward: small embeddings 
are free parameters which are not constrained, whereas the 



















































distilled embeddings are subject to the form in Equation 
Formally, let 'Hd,'Hs be the hypothesis classes of networks 
with distilled/small embeddings, respectively. For each ha G 
Hd with cumbersome embeddings <l?c and encoding parame¬ 
ters VFencode, ^encode , there exists SL hypothesis hs £ Hs, sat¬ 
isfying that hs = hd with small embeddings <f>s, whose i**' 
column (the small embedding for i**' word) is /{Wencode^cXi 
+hencode). Heuce, Hd C Hs- 

A curious question is then why distilling embeddings may 
help, compared with directly training the neural network 
with small embeddings. We provide an intuitive explanation 
as follows. 

Since word embeddings are typically learned from a large 
corpus in an unsupervised manner, the knowledge in em¬ 
beddings is restrained by dimensionality. For example, the 
sentiment of a word is of secondary importance compared 
with its syntactic functionality in a sentence. Hence, senti¬ 
ment information might be lost in low-dimensional embed¬ 
dings, which is unfavorable in a sentiment analysis task. On 
the contrary, large embeddings have the capacity to capture 
different aspects of word semantics. The proposed super¬ 
vised encoding approach may then distill task-specific (e.g., 
sentiment) knowledge to a small space, while eliminating ir¬ 
relevant information. Therefore, we may reasonably expect 
that distilling embeddings would outperform direct use of 
small ones. 

Deployment Issues 

Before deploying the model, we shall precompute the dis¬ 
tilled embeddings, vec(-), according to Equation[^after train¬ 
ing all parameters. The original embeddings ^cX and en¬ 
coding parameters (Wencode and ^encode) can then be safely 
discarded, and we obtain a small model (dashed rectangle 
in Figure &) with a set of small embeddings, which are 
distilled from large ones. 

As we shall see in the experiments, the small model will 
be very computational efficient because we have reduced a 
large number of parameters. 

4. EVALUATION 

In this section, we present our experimental results. We 
first describe the testbed and protocol of our experiments in 
Subsection |4.1 1 Then we analyze in Subsection |4.2| the per¬ 
formance of our approach regarding several aspects, namely 
accuracy, memory, and time consumption. 

4.1 Tasks, Models, and Protocols 

We tested our distilling approach in two tasks: sentiment 
analysis and relation classification. 

The sentiment analysis task aims to classify a sentence 
into 5 categories according to its sentiment: strongly/weakly 
positive/negative and neural. We used Stanford Sentiment 
Treebanl|^ as our dataset, which contains 8544/1101/221 
sentences for training, validation, and testing. Phrases (sub¬ 
sentences) in the training set are also labeled with sentiment, 
enriching the training set to more than 150k samples. For 
validation and testing, only the sentiment of a whole sen¬ 
tence was considered. 

The second task is to classify the relation between two 
^ http: //nip .Stanford, edu / sentiment / 


Task 

Method 

Acc. 

9^Param 

Time 

Sentiment 
analysis 
by TBCNN 

Cumbersome embed. 

51.6 

6.9M 

lx 

Small embed. 

Distilled embed. 
Matching softmax 

46.4 

47.5 
45.8 

0.94M 

(0.14x) 

0.04x 


Table 1: Comparison between cumbersome embed¬ 
dings, small embeddings, matching softmax, and dis¬ 
tilled embeddings. The official measure is accuracy 
(acc.) in percentage. 


Task 

Method 

Fi 

^Param 

Time 


Cumbersome embed. 

82.1 

8.8M 

lx 

Relation 

classification 

Small embed. 

Distilled embed. 

79.0 

79.4 

1.3M 

0.04x 

by SDP-LSTM 

Matching softmax 

80.1 

(0.15x) 


Hybrid 

80.2 




Table 2: Comparison between cumbersome embed¬ 
dings, small embeddings, matching softmax, and dis¬ 
tilled embeddings. We further made an attempt 
to combine matching softmax and distilling embed¬ 
dings (denoted as “Hybrid”). The official measure 
for relation classification is the Fi-score. 

tagged entities in a sentence. The SemEval 2010 dataset]^ 
we used, comprises 8000 training samples, from which we 
split 10% for validation; there are additional 3000 samples 
for testing. Target labels include 9 directed relations (e.g. 
Component-Whole) plus a default Other; in total, we have 
19 classes. The official Fi-score was applied as our measure¬ 
ment. 

To set up our experiments, we leveraged two state-of-the- 
art neural models: a tree-based convolutional neural net¬ 
work (TBCNN) for sentiment analysis [^, and a long short 
term memory-based recurrent network along shortest depen¬ 
dency path (SDP-LSTM|^ between two entities for relation 
classification [13| . 

For each task, we evaluated our proposed methods by dis¬ 
tilling 300-dimensional embeddings to 50 dimensions, fur¬ 
ther processed by a thin network (also 50d). In comparison, 
we trained the 50d network directly with small 50d embed¬ 
dings. All models were trained by mini-batch gradient de¬ 
scent with back-propagation. For both settings of distilling 
and non-distilling, we tried extensive configurations of hy¬ 
perparameters, mainly following the original papers]^ After 
choosing the setting with the highest validation accuracy, 
we ran each model 5 times for smoothing with different ran¬ 
dom initializations, and report the average test accuracy or 
Fi-score. 

4.2 Results 

Tables and [^presents the results of our proposed model 
as well as two competing settings: training a wider network 
with cumbersome embeddings, and directly training a thin 
network with small embeddings. 

^http://semeval2.fbk.eu/semeval2.php?location=data 
^ We only used word embeddings, and ignored other fea¬ 
tures like hyponymy, dependency types, which were used in 
|l3| . In this way, we focus on the problem of embedding 
distillation itself. 

Due to the limitation of space, we list candidate configu¬ 
rations on our website: 

https: // sites, google, com / site / distillembeddings / 






















In both experiments, cumbersome embeddings yield the 
highest performance, the distilled embeddings rank second, 
and small embeddings are worst. Basically, our method out¬ 
performs direct training of a small network by a margin of 
approximately one standard deviation (std = 1.1 and 0.6, re¬ 
spectively). As the results were obtained by averaging over 
5 initializations, we deem the improvement is fair. 

Regarding model complexity, our distilled embeddings re¬ 
duce memory and time to a large extent to 14-15% and 4%, 
respectively (C++ implementation on a single CPU). There¬ 
fore, the resulting network is significantly more lightweight, 
which is helpful to deployment in neural networks’ applica¬ 
tions. 

To further test our method under extreme conditions, we 
distilled word embeddings to lOd and 30d. We chose to con¬ 
duct the experiments in the second task, because it is of 
lower variance. As demonstrated in Figure our method 
consistently outperforms direct training with small embed¬ 
dings in all scenarios; moreover, the margin increases when 
the dimension becomes small. Such result is consistent with 
our human intuition, and verihes the conjecture in Section 

[3]— small embeddings contain less knowledge specific to the 
task of interest; the proposed supervised encoding approach 
can distill task-specific knowledge from large embeddings. 

We also notice that our approach is, in fact, complemen¬ 
tary to existing matching softmax methods: the encoding 
layer distills task-specific knowledge from large embeddings 
in a bottom-up fashion, whereas matching softmax distills 
generic knowledge in a top-down fashion. 

In both tasks, we also tried the matching softmax ap¬ 
proach, whose settings and hyperparameters are mainly de¬ 
rived from [^, i.e., T — 2 and a 1:1 mixture of ground truth 
and the teacher model’s output. Its performance is not con¬ 
sistent: in the sentiment analysis task, matching softmax 
hurts the performance by 0.6%, whereas it improves the re¬ 
lation classihcation task by 1.1%. (See Tables[^and[^again.) 
One plausible explanation is that the teacher model itself has 
not achieved remarkable accuracy (only about 50%) in the 
5-way sentiment classihcation task. Using a teacher model 
introduces additional knowledge as well as errors. If the 
latter dominates, matching softmax may hurt the student 
model. However, our encoding approach to embedding dis¬ 
tillation does not reply on a teacher model. Combined with 
matching softmax, it improves another 0.1% (although may 
not be large) in the second experiment, showing that the 
two methods can be potentially combined, as they are com¬ 
plementary to each other. 

5. CONCLUSION 

In this paper, we addressed the problem of distilling em¬ 
beddings for NLP, which is important when deploying a neu¬ 
ral network in resource-restricted scenarios. We proposed 
an encoding approach that distills cumbersome word em¬ 
beddings to a low dimensional space. Experimental results 
have shown the superiority of our proposed distilling method 
to training neural networks directly with small embeddings; 
that the performance gain increases signihcantly especially 
when the dimension becomes small. Moreover, our approach 
does not reply on a teacher model, which is complementary 
to matching softmax; these two methods of knowledge dis¬ 
tillation could also be combined. 




Small embedding 

1 1 Distilled embedding 



Figure 2: Accuracy versus dimension in the experi¬ 
ment of relation classification. 
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